Assignment 2(A): Applying Decision Tree and Random Forest Algorithms to Automobile Loan Evaluation

Acknowledgment

You are required to acknowledge the following statement by entering your full name, SID, and date below:

"By continuing to work on or submit this deliverable, I acknowledge that my submission is entirely my independent original work done exclusively for this assessment item. I agree to:

Furthermore, I acknowledge that I have not engaged and will not engage in any activities that dishonestly improve my results or dishonestly improve/hurt the results of others, and that I abide to all academic honor codes set by the University."

Your full name:
Pang Fong Chun
Your SID:
3035281299
Date:
24-06-2022

1. Introduction

In this part of the assignment, you will implement the decision tree and random forest learning algorithms, and use the models learned by these algorithms to make decisions on automobile loan application. You are required to complete the lines between START YOUR CODE HERE and END YOUR CODE HERE (if applicable) and to execute each cell. Within each coding block, you are required to enter your code to replace None after the = sign (except otherwise stated). You are not allowed to use other libraries or files than those provided in this assignment. When entering your code, you should not change the names of variables, constants, and functions already listed.

Contents

Before we begin the exercises, we need to import all required libraries.

2. Automobile Loan Dataset

2.1. Data Description

The dataset includes 7,180 records of automobile (used cars) loan applications processed by a bank. Each record is described by 27 features as listed below (an additional unnamed ID (first column) is not listed). The feature LoanStatus (approved or denied) is the target feature. The text file named raw_classification_data.csv stores each record as one row having the feature values separated by commas.

Feature Description
ModifiedCreditScore Greater of the Credit score and Co-Credit Score.
ModifiedBankruptcyScore Greater of the Bankruptcy score and Co-Bankruptcy Score.
EmployedMonths Stated number of months that the applicant has been employed with their current employer.
TotalMonthlyIncome Sum of the applicants and the co-applicants monthly income.
PrimeMonthlyLiability Stated non-rent liabilities of applicant.
PrimeMonthlyRent Applicant's stated monthly housing expense.
TotalMonthlyDebtBeforeLoan Sum of applicant and co-applicants housing payments and liabilities.
VehicleYear Year of the vehicle the applicant is looking to purchase.
VehicleMileage Number of miles on the vehicle the applicant is looking to purchase.
TotalVehicleValue Amount the vehicle is being sold for.
AmountRequested Amount the applicant is requesting to borrow.
DownPayment Amount of money the applicant is paying upfront toward the vehicle loan.
Loanterm Number of months applicant has to pay loan off.
OccupancyDuration Stated number of months the applicant has been in their current residence at the time of the application.
EstimatedMonthlyPayment Estimated monthly payment based on loan amount, interest rate, and loan term.
NumberOfOpenRevolvingAccounts Count of revolving accounts that appear on the applicant's credit report.
LTV Vehicle's loan to value ratio.
DTI Applicant's debt to income ratio based on credit report and loan type.
Source Identifies channel from which application was received.
EmploymentStatus Indicates if the applicant was employed at the time application was submitted.
VehicleMake Make of the vehicle the applicant is looking to purchase.
isNewVehicle Indicates if the vehicle the applicant is looking to purchase is new or used.
OccupancyStatus Stated occupancy status of the applicant at the time of the application.
RequestType Type of vehicle loan requested by the applicant (Refinance, lease buyout, etc.)
MemberIndicator Indicates if applicant was a bank member before applying for loan
CoApplicantIndicator Indicates whether or not a co-applicant is present on the application
LoanStatus Indicates whether loan was approved or denied

2.2. Data Loading

In this section, you use the pandas functions read_csv() to load the dataset, info() to generate a summary, drop() to drop the first unnamed feature column. You can optionally use head() to display first several records.

2.3. Data Visualization

You can visualize the distribution of each feature by executing the following code block. All numeric (continuous) features are visualized by blue bars, whereas all categorical features are visualized by red bars.

2.4. Data Processing

Compared with other learning algorithms such as linear and logistic regressions, decision tree algorithm requires only simple procedures for data pre-processing.

  1. separate the whole dataset with selected features and classification labels
  2. separate training ($70\%$) and testing dataset ($30\%$) by train_test_split() using the scikit-learn library.

Note that one-hot encoding is not necessary in this task.

3. Decision Tree

3.1. Tree Node

When constructing the decision tree, the Node class is used to represent the tree structure. This class is initialized by __init__() function, a built-in magic function of a class. Five necessary variables are in Node class as parameters:

Task 1:

  1. assign isLeaf to self.isLeaf (1 line)
  2. assign label to self.label (1 line)
  3. assign threshold to self.threshold (1 line)

[Test Block 1]: Test code for class Node()to show the node definitions and how to query its variables.

3.2. Attribute Selection Measure

Now, you will implement the entropy measure to select the best attribute among all features to split a node.

3.2.1. Entropy

Entropy measures the impurity in a dataset and is originally developed by Shannon (1949):

$$Entropy(S) = -\sum_{i=1}^c p_i \log_2{p_i} \ . \tag{1}$$

Here, $c$ is the number of classes and $p_i$ is the probability associated with the $i$-th class. For example, if 80 of 100 data records in $S$ have LoanStatus="Declined" and 20 of the 100 records have LoanStatus="Approved", then the entropy is $-0.8 * \log_2(0.8) - 0.2 * \log_2(0.2) = 0.20684.$

Task 2: You will

  1. remove zero value with indexing by conditional statement, e.g, if a=np.array([1,3,-4,1]), then a[a!=1] returns array([3, -4]) (removes from a elements not equal to $1$). Save the result in type_counts_no_zero. (1 line)
  2. compute the proportion of data instances having the specific feature value. Save the result in proportion. (1 line)
  3. calculate an entropy element of a feature value based on Equation (1). Save the result in "ent_element". The function np.log2(x) returns the base-2 logarithm of the input x. (1 line)
  4. add the entropy element to the total entropy value ent. (1 line)

[Test Block 2]: Test code for function entropy().

3.2.2. Information gain

You will implement the function compute_gain() to compute the information gain based on the computation of entropy you implemented above. The information gain can be represented by Equation (2) that is used in the ID3 algorithm:

$$ Gain(S, A) \equiv Entropy(S) - \sum_{v\in Values(A)}\frac{|S_v|}{|S|}Entropy(S_v) \ . \tag{2}$$

This is the information gain (a.k.a. mutual information) of an attribute $A$ in the dataset $S$. $|\cdot|$ denotes the number of data instances in the sub-dataset $S_v$ partitioned from $S$, in which $v$ is in the set of all possible values of $A$.

Task 3: You will

  1. calculate entropy of "unionset" before split with your implemented function entropy(), assign the result to impurityBeforeSplit. (1 line)

In each iteration of the for loop, you compute the following values for each node (subset of data instances) branched out from a parent node:

  1. the size of the subset (number of data instances) with shape[x], in which x is the dimension you need to fill in. In a 2-dimensional dataset, the first dimension (0) is the number of data instances; the second dimension (1) is the number of features. Then, save it in subset_num. (1 line)
  2. the entropy of each subset with your implemented entropy(), save it in subset_entropy. (1 line)
  3. the weight $\frac{|S_v|}{|S|}$ of each subset in all subsets according to Equation (2), save it in weight. (1 line)
  4. the impurity after split of this subset (multiply weight and subset entropy), add it to impurityAfterSplit. (1 line)

[Test Block 3]: Test code for function compute_gain().

3.3. Best Attribute Selection

You will implement the split_attribute() function to select the best attribute to split each node. When considering a categorical feature, the node will be split by the feature values using the ID3 algorithm. When considering a continuous feature, the C4.5 algorithm will be used.

Task 4: You will

For a categorical feature:

  1. compute information gain of the categorial attribute with your implemented compute_gain(). (1 line)

For a continuous feature:

  1. sort the data records by attribute values in ascending order with sort_values() of "curData", a pandas DataFrame datatype (1 line)
  2. compute the partition point threshold of two different consecutive records by $\mbox{threshold}=\frac{v_i + v_{i+1}}{2}$. You should use the loc[a, b] of sorted_data to get the specific data item. a is the row index and b is the name of attritube. Please replace a and b with corresponding variables (1 line)
  3. separate the data with the selected "threshold", save into greater_data (larger than threshold) and less_data (smaller than threshold) (2 lines)
  4. save less_data and greater_data (in this order) in subsets as a list, e.g., subset is [a, b], where a and b are datasets (1 line)
  5. compute information gain of the continuous attribute with your implemented compute_gain() (1 line)

[Test Block 4]: Test code for function split_attribute().

3.4. Recursive Tree Construction

You will use a recursive approach to generate the decision tree. You need to pay attention to the way to stop the recursion of tree construction. (There are three possibilies to stop the recursion and return a tree node.)

Task 5:

  1. extract key(s) of dictionary "class_dict", save them in "keys" (1 line)
  2. get the class from "keys", please transfer "keys" into list with function list() and get the current label with the corresponding index (1 line)
  3. generate a new node as leaf node with class Node() (1 line)
  4. get the major class from the dictionary "class_dict", please use the fucntion max() to finish this task. Note that you can use the input key= and the function get() of dictionary is a good candidate for the key input. For more details, please check this website (1 line)
  5. generate a new node as leaf node with class Node() (1 line)
  6. build the children nodes recursively using generate_tree(). Please use the list comprehension and change the "None" value in it to generate a list of children nodes. (1 line)

3.5. Predictive Function

You will use the decision tree model ("tree_node") generated by generate_tree() to predict the label of each testing data. The output of this function predict() are "Approved", "Declined", and "Fail". Note that "Fail" will occur when query data has the new feature that the decision tree has not seen yet.

Task 6: Please check if the current tree node belongs to categorical or continuous attribute:

  1. if the child is a leaf node, direct return its "label" variable (1 line)
  2. if not, use the function tree_predict() recursively (1 line)
  1. compare the value query[node.label] with the threshold of node and select the corresponding child (2 lines)
  2. if the child is a leaf node, direct return its "label" variable (1 line)
  3. if not, use the function tree_predict() recursively (1 line)

3.6. Tree Generation and Evaluation

Tree generation

Now, you will use the training dataset data_train to generate a decision tree with your implemented function generate_tree(). All features on training data will be included.

Tree Performance Evaluation

To evaluate the performance of trained decision tree, your implemented function tree_predict() is used to test each data instance in the testing dataset.

Task 7: You shoudld:

  1. check the accuracy score of the learned random forest with function accuracy_score() (1 line)
  2. get the classification report with ConfusionMatrixDisplay.from_predictions(), save it in "report" (1 line)
  3. visualize the confusion matrix with ConfusionMatrixDisplay.from_predictions(). Repalce two None inputs with correct variables. (1 line)

3.7. Printing the Decision Tree

You will use the following function print_tree() to recursively print out the decision tree and to save the result in a file named decisionTree.txt.

You will execute the print_tree() function below to print the tree in a text file.

4. Random Forest

Random Forest is a type of ensemble learning using bootstrap aggregation. Bootstrap aggregation (or bagging) reduces variance (noise) of model predictions by drawing (with replacement) many random training samples of datasets from a large dataset. Random Forest improves over bagging by taking a random fixed-size subset of features in the bootstrapped samples, thus reducing correlation among trees typically produced by bagging.

4.1. Forest Generation

Task 8: you will implement the function generate_forest() to generate a single decision tree in each iteration. You will:

  1. randomly select $m = 10$ features from all $M=27$ features. You should use the function np.random.choice(). Because you should choose $m$ features without replacement. Please set the parameters size= and replace= to correct values. (1 line)
  2. randomly select $N=7,180$ items from the dataset in totally $N=7,180$ data items with replacement. Please use function sample() of pandas data (DataFrame) to finish this task. Also, please set the inputs replace= and frac= correctly. (1 line)
  3. generate the decision tree with the selected features and data subset with your implemented function generate_tree(). (1 line)

To visualize the structure of decision trees in random forest, this assignment also shows how to use DecisionTreeClassifier() using the scikit-learn library to construct a random forest. Because the implementation of decision tree in Scikit-learn does not accept categorical data, One-hot encoding must be done and DictVectorizer() is used to finish the encoding.

4.2. Predictive Function

When you have constructed a random forest with several decision trees (two versions: self-implemented and scikit-learn), your next task is to test the performance of two versions.

Task 9: The first function forest_predict() will:

  1. transform the result "prediction_list" into pandas data structure pd.Series (1 line)
  2. get prediction result with majority voting. Please use function mode() of pandas data (1 line)

The scikit-learn function forest_predict_scikit() is shown below.

4.3. Performance Evaluation

You will generate the random forest with your implemented function generate_forest_scikit() and evaluate its performance with accuracy_score(), confusion_matrix(), and classification_report() using the scikit-learn library.

Task 10: You will

  1. check the accuracy score of the learned random forest with function accuracy_score() (1 line)
  2. get the classification report with ConfusionMatrixDisplay.from_predictions(), save it in "report" (1 line)
  3. visualize the confusion matrix with ConfusionMatrixDisplay.from_predictions(). Repalce two None inputs with correct variables. (1 line)

Due to random sampling used in bagging and in feature sampling, the performance of random forest varies, and you should obtain an accuracy of 80% approximately. To accelrate the learning procedures, you can change the parameter tree_number. For example, if tree_number=10, then it will run much faster but the performance will vary significantly due to a higher variance.

To evaluate the performance of random forest (RF), you will test the predictive accuracy using different numbers of trees in RF, ranging from 10 to 100 with an increment of 10. Every selected tree number (e.g., 10, 20, ... , 50, ... , 100) will be tested 10 times (specified in test_num). Note that executing this part will need 25 minutes or longer, depending on your hardware. So, after starting it, take a walk and get back in half an hour!

The following block includes the code for visualization. The mean value and standard variation are selected to reflect the distribution of resulting accuracy of random forest with different tree numbers.

Upon executing the code below, a random forest with 5 decision trees is trained and visualized. When the trees are displayed below, you can double-click to enlarge the following figure and scroll through it to view the details.

5. Prediction of Sampled Data

Finally, to show the prediction effect of the decision tree and the random forest more intuitively, several sampled data items in the testing dataset are used to predict decisions on automobile loan application.

6. Marking Scheme and Submission

This part carries 80% of the assignment grade. The Quiz posted on Moodle carries 20%. Late submission will incur a 30% deduction. The marking scheme of this part follows.

Task Summary

Task Grade Points
1. Tree Node (Node()) 6
2. Entropy (entropy()) 10
3. Information Gain (compute_gain()) 10
4. Best Attribute Selection (split_attribute()) 10
5. Tree Generation (generate_tree()) 10
6. Tree Result Prediction (tree_predict()) 8
7. Tree Evaluation 8
8. Forest Generation (generate_forest()) 6
9. Forest Result Prediction (forest_predict()) 4
10. Forest Evaluation 8
TOTAL 80

Submission

You are required to upload to Moodle a zip file containing the following files.

  1. Your completed Jupyter Notebook of this part. Please rename your file as A2A_[SID]_[FirstnameLastname].ipynb (where [SID] is your student ID and [FirstnameLastname] is your first name and last name concatenated) and do not include the data file. You must complete the Acknowledgment section in order for the file to be graded.
  2. The PDF version (.pdf file) of your completed notebook (click File > Download as > PDF via HTML (If error occurs, you may download it as HTML and then save the HTML as PDF separately)).

In addition, please complete A2Q: Assignment 2 -- Quiz separately on the Moodle site.

7. Summary

Congratulations! You have implemented your Decision Tree (DT) and Random Forest (RF) machine learning algorithm in this course! To summarize, you have implemented the basic data structure "Node" of decision tree, entropy, information gain, attribute separation, recursive tree generation, forest generation and the evaluation of decision tree and random forest. You have run the algorithm to identify the optimal DT and the RF model using the training dataset, evaluated the performance of the model using the testing dataset, and applied the model to predicting prices of sampled data.